Week 1: Introduction to Open Data Science

I am thrilled to dig into the world of open science and learn new tricks with the fascinating open source materials.

Link to my GitHub repository is here!

“Two things are infinite: the universe and human stupidity; and I’m not sure about the universe.” - Albert Einstein


Week 2: Regression and model validation

Description of the exercise phases and interpretation of the results

Part 1: Data

Data description

Exercise focuses on data wrangling using an original dataset called “JYTOPKYS2” by Kimmo Vehkalahti (more information here), creating a subset of the dataset and analysing it with basic methods, as well as with regression analysis, which is this week’s topic.

The dataset is called “learning2014”, and it consists of 166 observations
in 7 columns, describing each participant’s learning habits.

Here are the first six rows and the structure of the dataset.

head(learning2014)
##   gender age attitude     deep  stra     surf points
## 1      F  53      3.7 3.583333 3.375 2.583333     25
## 2      M  55      3.1 2.916667 2.750 3.166667     12
## 3      F  49      2.5 3.500000 3.625 2.250000     24
## 4      M  53      3.5 3.500000 3.125 2.250000     10
## 5      M  49      3.7 3.666667 3.625 2.833333     22
## 6      F  38      3.8 4.750000 3.625 2.416667     21
str(learning2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...

Column names describe the following:
1. gender - Gender of the person
2. age - Age of the person
3. attitude - Global attitude toward statistics
4. deep - Combination of different questions regarding “deep learning”, as in seeking meaning, relating ideas, use of evidence in the studies (reference)
5. stra - Combination of different questions regarding “strategic learning”, as in organized studied, time management (reference)
6. surf - Combination of different questions regarding “surface learning”, as in lack of purpose, unrelated memorising and syllabus-boundness (reference)
7. points - Points from the statistics exam

From the structure we can see that six column values are numeric or integer values, and one value is a factor value, having levels “M” or “F”.


Structure of the variables

In the following code blocks can be found summaries of each variable, including the gender distribution (gender) and descriptive variables for numeric values. It can be seen that there are almost 50 % more female participants than male participants, mean age is 25.5 years, average attitude is around 3.1 (scale 1-5), and average values for deep, strategic and surface learning are between 2.8 and 3.7. Average points are 22.7, when the minimum is 7 and maximum is 33.

summary(learning2014$gender)
##   F   M 
## 110  56
summary(learning2014$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   17.00   21.00   22.00   25.51   27.00   55.00
summary(learning2014$attitude)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.400   2.600   3.200   3.143   3.700   5.000
summary(learning2014$deep)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.583   3.333   3.667   3.680   4.083   4.917
summary(learning2014$attitude)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.400   2.600   3.200   3.143   3.700   5.000
summary(learning2014$deep)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.583   3.333   3.667   3.680   4.083   4.917
summary(learning2014$stra)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.250   2.625   3.188   3.121   3.625   5.000
summary(learning2014$surf)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.583   2.417   2.833   2.787   3.167   4.333
summary(learning2014$points)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.00   19.00   23.00   22.72   27.75   33.00

Relationship between the variables

In order to see the relationship between these variables, we will create scatter plots for all the into a scatter plot matrix.

# This matrix plot is created using ggplot2 and GGally libraries
ggpairs(learning2014, mapping = aes(col = gender, alpha=0.3), 
        lower = list(combo = wrap("facethist", bins = 20)))

By eyeing the plot, we can see that the highest positive correlation is between the ‘points’ and ‘attitude’, and highest negative correlation is between ‘surface learning’ and ‘deep learning’ (because they also are kinda opposite…).

The plots also show that the distribution of exam results is divided quite equally between female and male students, but that more girls have chosen higher values in surface learning and boys generally have a better attitude compared to girl participants.


Part 2: Regression

Regression analysis

In this part a regression analysis for the dataset will be conducted. For the analysis, which will describe the causality of the dependent variable ( in this case points ) with chosen explanatory variables. The analysis seeks to figure, whether the explanatory variables explain the dependent variable. The results will show whether there is a statistical significance, hence if the explanatory variables truly explain the dependent variable.

1. Choosing the explanatory variables

  • As can be interpreted from the scatter plot above, three variables with the highest correlations towards the dependant variable points are attitude, stra and surf. Therefore we will next create a multiple regression model.
  • The formula should be y ~ x, where y is the target variable and x the explanatory variable. In multiple regression model, all explanatory variables are added behind x, as x1, x2, x3 etc as below

y ~ x1 + x2 + x3

2. Fitting the regression model

In the regression model exam points (points) is the target variable (y) and the three variables chosen above explanatory variables. In R creating a regression model happens with formula lm():

model <- lm(points ~ attitude + stra + surf, data = learning2014)
summary(model)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

Interpretation of the summary of the fitted model:

  • According to the summary table, only attitude is statistically significant explanatory variable to the dependant variable.

  • The residuals (distance from the regression line) are between -17.15 and 10.89 with median of 0.52, and with a total residual standard error 5.296.

  • The ‘estimates’ show the effect of the variables to points

3. Creating second model based on statistical significance

Since there is only one statistically significant variable, the regression model will be fitted again only with that variable:

model2 <- lm(points ~ attitude, data = learning2014)
summary(model2)
## 
## Call:
## lm(formula = points ~ attitude, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09
  • Now we can see that the standard error for the estimate has decreased a little, and the estimate increased a little
  • Also the p-value has decreased for both intercept value and attitude variable, suggesting that the statistical significance has increased.

  • The multiple R-squared in the summary means how close the data is to the regression line, and shows the percentage of the response variable variation that is explained by a linear model. R-squared is always between 0 and 100%, where 0% means there is none of the variability of the response data around its mean, and 100% means all of the variability of the data is around its mean.
  • In this ‘model2’, multiple R-squared is 0.1906 = 19 %. It means that the model explains 19 % of the variance.
  • Low values don’t necessarily mean that there is no significance - as we can see there is statistical significance. It is said that any field that attempts to explain human behavior typically has R-squared values below 50%.

Part 3: Creating diagnostic plots

Diagnostic plots

Diagnostic plots are made to validate the regression model assumptions. In R it is easy and done with the basic plot() function by giving the model as the first argument. Assumptions are always part of the models, and the fact how well the model describes the phenomenon of interest depends on how will the assumptions fit reality.


We will create plots for Normal QQ-plot, Residuals vs Fitted values and Residuals vs Leverage.

  1. QQ-plot explores the assumption that the errors of the model are normally distributed
plot(model2, 2)

  • We can see that that majority of the points follow the curve, however after the x-value 1, the curve is clearly turning a bit lower on the y-axis, making the curve to dip a little. I would still intrepret the curve as somewhat reasonable.

  1. Residual and fitted values depict the constant variance of errors, and that size of errors
    should not depend on the explanatory variable. The more scattered the values, the more reasonable.
plot(model2, 1)

  • From the plot we can see that the residuals have spread quite equally throughout the plot randomly and are not concentrated on any fitted value specifically, so the plot is reasonable.

  1. Residuals vs Leverage plot compares the residuals to leverage, which is the impact of a single observation on the model, and helps to find influental outliers. When cases are outside of the Cook’s distance (meaning they have high Cook’s distance scores), the cases are influential to the regression results.
plot(model2, 5)

  • In the plot above we can see that there are no individual cases popping out from the graph. Almost all cases are inside the Cook’s distance lines.

Conclusion

According to the data and regression analyses and model validation, we could conclude that the model created is valid for studying the relationship between the exam points and students’ attitudes. Lastly, here is a plot visualizing the linear relation between these two variables.

# initialize plot with data and aesthetic mapping
p1 <- ggplot(learning2014, aes(x = attitude, y = points, col =gender)) + geom_point() + geom_smooth(method = "lm") + ggtitle("Student's attitude versus exam points")

# Draw the plot
p1



Week 3: Logistic regression

Description of the exercise phases and interpretation of the results

Part 1: Data

Data description

Data for this week’s exercise is retrieved from UCI Machine Learning Repository by Paulo Cortez in 2008 (University of Minho, Portugal). Data consists of two joined dataframes about students’ achievements in secondary education in two Portuguese schools. The first table was with students taking the math course, and the other taking the Portuguese course. For this exercise, the two datasets have been joined together (code for the data wrangling operations is here).

Glimpse of the data:

school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason nursery internet guardian traveltime studytime failures schoolsup famsup paid activities higher romantic famrel freetime goout Dalc Walc health absences G1 G2 G3 alc_use high_use
GP F 18 U GT3 A 4 4 at_home teacher course yes no mother 2 2 0 yes no no no yes no 4 3 4 1 1 3 5 2 8 8 1.0 FALSE
GP F 17 U GT3 T 1 1 at_home other course no yes father 1 2 0 no yes no no yes no 5 3 3 1 1 3 3 7 8 8 1.0 FALSE
GP F 15 U LE3 T 1 1 at_home other other yes yes mother 1 2 2 yes no yes no yes no 4 3 2 2 3 3 8 10 10 11 2.5 TRUE
GP F 15 U GT3 T 4 2 health services home yes yes mother 1 3 0 no yes yes yes yes yes 3 2 2 1 1 5 1 14 14 14 1.0 FALSE

Column names for the dataset are:

colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

Structure of the data is the following:
Observations: 382
Variables: 35

summary(alc)
##  school   sex          age        address famsize   Pstatus
##  GP:342   F:198   Min.   :15.00   R: 81   GT3:278   A: 38  
##  MS: 40   M:184   1st Qu.:16.00   U:301   LE3:104   T:344  
##                   Median :17.00                            
##                   Mean   :16.59                            
##                   3rd Qu.:17.00                            
##                   Max.   :22.00                            
##       Medu            Fedu             Mjob           Fjob    
##  Min.   :0.000   Min.   :0.000   at_home : 53   at_home : 16  
##  1st Qu.:2.000   1st Qu.:2.000   health  : 33   health  : 17  
##  Median :3.000   Median :3.000   other   :138   other   :211  
##  Mean   :2.806   Mean   :2.565   services: 96   services:107  
##  3rd Qu.:4.000   3rd Qu.:4.000   teacher : 62   teacher : 31  
##  Max.   :4.000   Max.   :4.000                                
##         reason    nursery   internet    guardian     traveltime   
##  course    :140   no : 72   no : 58   father: 91   Min.   :1.000  
##  home      :110   yes:310   yes:324   mother:275   1st Qu.:1.000  
##  other     : 34                       other : 16   Median :1.000  
##  reputation: 98                                    Mean   :1.448  
##                                                    3rd Qu.:2.000  
##                                                    Max.   :4.000  
##    studytime        failures      schoolsup famsup     paid     activities
##  Min.   :1.000   Min.   :0.0000   no :331   no :144   no :205   no :181   
##  1st Qu.:1.000   1st Qu.:0.0000   yes: 51   yes:238   yes:177   yes:201   
##  Median :2.000   Median :0.0000                                           
##  Mean   :2.037   Mean   :0.2016                                           
##  3rd Qu.:2.000   3rd Qu.:0.0000                                           
##  Max.   :4.000   Max.   :3.0000                                           
##  higher    romantic      famrel         freetime        goout      
##  no : 18   no :261   Min.   :1.000   Min.   :1.00   Min.   :1.000  
##  yes:364   yes:121   1st Qu.:4.000   1st Qu.:3.00   1st Qu.:2.000  
##                      Median :4.000   Median :3.00   Median :3.000  
##                      Mean   :3.937   Mean   :3.22   Mean   :3.113  
##                      3rd Qu.:5.000   3rd Qu.:4.00   3rd Qu.:4.000  
##                      Max.   :5.000   Max.   :5.00   Max.   :5.000  
##       Dalc            Walc           health         absences   
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   : 0.0  
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:3.000   1st Qu.: 1.0  
##  Median :1.000   Median :2.000   Median :4.000   Median : 3.0  
##  Mean   :1.482   Mean   :2.296   Mean   :3.573   Mean   : 4.5  
##  3rd Qu.:2.000   3rd Qu.:3.000   3rd Qu.:5.000   3rd Qu.: 6.0  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :45.0  
##        G1              G2              G3           alc_use     
##  Min.   : 2.00   Min.   : 4.00   Min.   : 0.00   Min.   :1.000  
##  1st Qu.:10.00   1st Qu.:10.00   1st Qu.:10.00   1st Qu.:1.000  
##  Median :12.00   Median :12.00   Median :12.00   Median :1.500  
##  Mean   :11.49   Mean   :11.47   Mean   :11.46   Mean   :1.889  
##  3rd Qu.:14.00   3rd Qu.:14.00   3rd Qu.:14.00   3rd Qu.:2.500  
##  Max.   :18.00   Max.   :18.00   Max.   :18.00   Max.   :5.000  
##   high_use      
##  Mode :logical  
##  FALSE:268      
##  TRUE :114      
##  NA's :0        
##                 
## 

Information about the attributes:
- school - student’s school (binary: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira)
- sex - student’s sex (binary: ‘F’ - female or ‘M’ - male)
- age - student’s age (numeric: from 15 to 22)
- address - student’s home address type (binary: ‘U’ - urban or ‘R’ - rural)
- famsize - family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3)
- Pstatus - parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart)
- Medu - mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education)
- Fedu - father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education)
- Mjob - mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
- Fjob - father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
- reason - reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’)
- guardian - student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’)
- traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
- studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
- failures - number of past class failures (numeric: n if 1<=n<3, else 4)
- schoolsup - extra educational support (binary: yes or no)
- famsup - family educational support (binary: yes or no)
- paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
- activities - extra-curricular activities (binary: yes or no)
nursery - attended nursery school (binary: yes or no)
- higher - wants to take higher education (binary: yes or no)
- internet - Internet access at home (binary: yes or no)
- romantic - with a romantic relationship (binary: yes or no)
- famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
- freetime - free time after school (numeric: from 1 - very low to 5 - very high)
- goout - going out with friends (numeric: from 1 - very low to 5 - very high)
- Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
- Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
- health - current health status (numeric: from 1 - very bad to 5 - very good)
- absences - number of school absences (numeric: from 0 to 93)

these grades are related with the course subject, Math or Portuguese:
- G1 - first period grade (numeric: from 0 to 20)
- G2 - second period grade (numeric: from 0 to 20)
- G3 - final grade (numeric: from 0 to 20, output target)

these varables were created for the dataframe combining the alcohol consumption
- alc_use - Average alcochol use on weekdays and weekends together
- high_use - High alcohol use (alc_use > 2), logical (T/F)


Part 2: Data Analysis

Purpose of this analysis is to study the relationships between high and low alcohol consumption and four other chosen variables, and to numerically and graphically show the distributions between the variables.
Logistic regression is used to define the linear model for binary value “success”, the expected value of the target variable.

1. Choosing the four variables

The dataset is quite interesting, because it provides multiple socially interesting factors to compare with each other. The four variables I chose to compare with the alcohol comsumption are health, G3 (total grades), studytime and absences. Generally research concludes that men exceed women in high-volume drinking, but research also says that there is a clear female excess for the risk of becoming an underaged drinker. That’s why I’m including the gender variable in my analysis as well.

2. My personal hypotheses for each variables

health: Alcohol affects the human health, so logically high consumption would affect the health status, and that students who consume less alcohol would be healthier. So I believe there might be a connection, but of course other factors, like eating habits and sports affect more. G3 (grades): I believe there might be some connection, that students who consume more alcohol tend to get lower grades, but that the distribution is not that big.
studytime : I believe there can be seen some kind of correlation between alcohol consumption and studytime. absences: I believe high alcohol consumption leads to more school absences.

3. Distributions of variables to alcohol consumption

Comparison of predictive variables with the target variable (high_use):

# First plot: Distributions of high alcohol consumption and health status with gender distinction
g1 <- ggplot(alc, aes(x = high_use, y = health, col=sex)) + geom_boxplot() + ggtitle("Health status vs. alcohol consumption")  
       
# Second plot: Distributions of high alcohol consumption and G3(grades) with gender distinction
g2 <- ggplot(alc, aes(x = high_use, y = G3, col = sex)) + geom_boxplot() + ylab("grade") + ggtitle("Grades vs. alcohol consumption")  

# Third plot: Distributions of high alcohol consumption and study time with gender distinction
g3 <- ggplot(alc, aes(x = high_use, y = studytime, col = sex)) + geom_boxplot() + ylab("Study time") + ggtitle("Study time vs. alcohol consumption")

# Fourth plot: Distributions of high alcohol consumption and school absences with gender distinction
g4 <- ggplot(alc, aes(x = high_use, y = absences, col = sex)) + geom_boxplot() + ylab("absence") + ggtitle("Absences vs. alcohol consumption")

plot <- multiplot(g1, g2, g3, g4, cols=2) + geom_smooth(method = 'loess')  

Summary statistics:

alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_health = mean(health), mean_studytime = mean(studytime), mean_grade = mean(G3), mean_absence = mean(absences))
## # A tibble: 4 x 7
## # Groups:   sex [?]
##   sex   high_use count mean_health mean_studytime mean_grade mean_absence
##   <fct> <lgl>    <int>       <dbl>          <dbl>      <dbl>        <dbl>
## 1 F     FALSE      156        3.38           2.34       11.4         4.22
## 2 F     TRUE        42        3.40           2          11.7         6.79
## 3 M     FALSE      112        3.71           1.88       12.2         2.98
## 4 M     TRUE        72        3.88           1.64       10.3         6.12

Interpretation

As the data was grouped by gender, we can see results for both male and female high and low alcohol consumers.

  • In plot 1 in upper left, health status is generally the same for both male and female students, with a drop in the median value within low-alcohol consuming female students, which is slightly surprising. Male students have exactly same range of answers and the same median value of 4, when female high-alcohol consuming students’s answers are more dispersed. In the summary table we can see that the mean health statuses are quite the same for both genders, but high-alcohol consuming students have estimated their health status to be slightly better than low-alcohol consumers.

  • In plot 2 on the upper right the study times differ between genders more than between the consumption factor. Therefore my hypothesis that there is a correlation between alcohol consumption and study time does not quite make sense, but it can be seen from the statistics that boys study less but also both boys and girls who consume more alcohol study a bit less in average.

  • In plot 3 in down left, we can see that the grades are quite equally distributed with students who are low-alcohol consumers. Withing students who are high-alcohol students, girls get generally slightly better grades than boys, and only boys have clear outliers from the data (with bad grades) in both low- and high-alcohol consuming students. So according to my hypothesis, alcohol consumption has an effect on total grades, but might still not be the only explanatory factor.

  • In plot 4 in down right, can be seen that girls in both groups have more outlier individuals with many absences. As box the statistics, we can see that high-alcohol consuming students have averagely more absences in both gender classes.


Part 3: Logistic regression

Logistic regression to find relationships between variables

First we will create a logistic regression model and study its summary and coefficients.

m <- glm(high_use ~ sex + health + studytime + G3 + absences , data = alc, family = "binomial")
summary(m)
## 
## Call:
## glm(formula = high_use ~ sex + health + studytime + G3 + absences, 
##     family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1871  -0.8325  -0.6075   1.0677   2.2106  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.42587    0.64922  -0.656 0.511845    
## sexM         0.81236    0.25409   3.197 0.001388 ** 
## health       0.03196    0.08731   0.366 0.714366    
## studytime   -0.35579    0.16124  -2.207 0.027344 *  
## G3          -0.06155    0.03665  -1.679 0.093089 .  
## absences     0.08642    0.02299   3.759 0.000171 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 420.17  on 376  degrees of freedom
## AIC: 432.17
## 
## Number of Fisher Scoring iterations: 4
coef(m)
## (Intercept)        sexM      health   studytime          G3    absences 
## -0.42586809  0.81236413  0.03195670 -0.35578595 -0.06155365  0.08642112
  • We can see that the statistically significant variables by high alcohol consumption in order are absences, sex, studytime and grades.

  • Health is insignificant, and the p-value for grades is greater than 0.05, so I will exclude them from the next model and do it again.

Next let’s see the coefficients as odd ratios:

m <- glm(high_use ~ sex + studytime + absences , data = alc, family = "binomial")
# compute odds ratios (OR)
OR <- coef(m) %>% exp
# Computing confidence intervals (CI)
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
# Printing out the odds ratios with their confidence intervals
cbind(OR, CI)
##                    OR     2.5 %    97.5 %
## (Intercept) 0.3993889 0.1760458 0.8977721
## sexM        2.2216751 1.3647760 3.6532630
## studytime   0.6686620 0.4847669 0.9067963
## absences    1.0928234 1.0467369 1.1459745
  • The larger the Odd Ratio, the more positive relationship. We see that sex and absences have the highest Odd Ratios, and that sex has also the biggest confidence interval.

Part 4: Predictions

In this part the predictive power of the model will be explored.

# Fitting the model with statistically significant variables
m <- glm(high_use ~ sex + absences + studytime, data = alc, family = "binomial")

# Predict() the probability of high_use
probabilities <- predict(m, type = "response")

# Add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)

# Use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probabilities > 0.5)

# Tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)%>% prop.table %>% addmargins()
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.67539267 0.02617801 0.70157068
##    TRUE  0.23036649 0.06806283 0.29842932
##    Sum   0.90575916 0.09424084 1.00000000
ggplot(alc, aes(x = probability, y = high_use, col=prediction)) + geom_point()

  • From the plot and the data we see that the model predicts less students do belong to high-alcohol consumption category.
  • Model’s are never perfect, since there is always a chance of incorrectly classified observations. The the model’s average number of wrong predictions can be computed with accuracy and loss function. The smaller the number (in scale from 0-1), the better, since it means less errors.
# Defining a loss function (mean prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# Calling loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, alc$probability)
## [1] 0.2565445
  • 0.2565 is the number of errors retreived using loss function

EXTRA:
When conducting a 10-fold cross-validation to the model, we get following results:

library(boot)
# K-fold cross validation, K = 10 because of the 10-fold CV
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)

# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2617801
  • Here we see that the number of incorrectly classified observations with 10-fold cross-validation model is 0.2617801, which divides the data into 10 sets, of which K-1 are used for training (making the predictions, finding the model) and one for testing (applied to an independent dataset), and this is repeated ten times. We can see that the average loss is slightly greater than with the loss function.

Week 4: Clustering and classification

Description of the exercise phases and interpretation of the results

Part 1: Data

Data description

This week’s exercise introduces classification and clustering as visual tools to explore statistical data.
The data used is a readily available R dataset in the MASS library called Boston. It contains data about housing values in Boston, USA. More information about the data can be found here.

Column descriptions:

crim = per capita crime rate by town.
zn = proportion of residential land zoned for lots over 25,000 sq.ft.
indus = proportion of non-retail business acres per town.
chas = Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
nox = nitrogen oxides concentration (parts per 10 million).
rm = average number of rooms per dwelling.
age = proportion of owner-occupied units built prior to 1940.
dis = weighted mean of distances to five Boston employment centres.
rad = index of accessibility to radial highways.
tax = full-value property-tax rate per $10,000.
ptratio = pupil-teacher ratio by town.
black = 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
lstat = lower status of the population (percent).
medv = median value of owner-occupied homes in $1000s.

data(Boston)
str(Boston) 
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

As we can see the data consists of 506 observations and 14 variables.

In the next graph we can see the correlations between the variables with 0.05 significance level.

cor_matrix <- cor(Boston) %>% round(2)
res1 <- cor.mtest(Boston, conf.level = .95)
corrplot(cor_matrix, p.mat = res1$p, method = "color", type = "upper",
         sig.level = c(.001, .01, .05), pch.cex = .9,
         insig = "label_sig", pch.col = "white", order = "AOE")

As we can see the darker the red colour, the more positive correlation the two variables have, and the darker the blue colour, the stronger negative correlation the variables have. Almost all variables are significant in 0.05 % significance level. The strongest correlation, according to the matrix, is between the distance to employment centers and age and nitrogen oxyides concentration, lower status (%) and median value of owner-occupied homes, to name a few.

Part 2: Scaling and dividing the data

At this point the data will be scaled for later use. Scaling is necessary for the later linear discriminant analysis, because it assumes the variables are normally distributed and each variable has same variance.

boston_scaled <- scale(Boston)
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865

Scaling scaled the variables, and as we can see, the min and max values are different. Scale-function, with default settings, calculates the mean and standard deviation of the columns, then “scales” each element by those values by subtracting the mean and dividing by the sd. As we can see, the ‘age’ parameter has minus values.

Next I will create a categorical variable from the scaled crime rate variable. Summary of the crime variable:

# First converting the matrix into a dataframe
boston_scaled <- as.data.frame(boston_scaled)
summary(boston_scaled$crim)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419400 -0.410600 -0.390300  0.000000  0.007389  9.924000

The break points will be the quantiles (25% - 100%). After creating the variable, I will remove the old ‘crim’ variable and replace it with the new ‘crime’, which, in the end, contains the amount of ‘cases’ in each quantile group.

# Creating a  quantile vector, and creating a categorical variable 'crime'
bins <- quantile(boston_scaled$crim)
# Creating categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label=c("low", "med_low", "med_high", "high"))  
# Table of the new crime variable
table(crime)  
## crime
##      low  med_low med_high     high 
##      127      126      126      127
# Removing original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# Adding the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

Training and testing set:

Splitting the data into training and testing sets lets us check how well our model performs. Training of the model is done with the training set (80 % of the dataset), and the model is then tested and data predicted with the testing set. This allows us to see how well our model, for example, classifies different points into groups.

Next I will divide the dataset to train and test sets. The functions of the code is commented int he code block.

# Choosing 80 % of the data as the training set based on the number of rows in the dataset
ind <- sample(nrow(boston_scaled),  size = nrow(boston_scaled) * 0.8)
train <- boston_scaled[ind,]
# Testing set is the data minus the training set
test <- boston_scaled[-ind,]
# Saving the actual correct classes into a new variable
correct_classes <- test$crime
# Removing the crime variable from test data
test <- dplyr::select(test, -crime)

Part 3: Linear discriminant analysis

LDA is a classification method that models either binary or multiple class variables, and where the target variable is categorial (like the crime variable). It is used to find patterns within the data and classify it effectively.

LDA is used to predict classes for new data and to find variables that either discriminate or separate the classes best (DataCamp). The difference between classification and clustering is, that in classification the classes are known and the model is trained with the training set from the data, and it classifies new values into classes. Clustering, on the other hand, means that the classes are unknown, but the data is grouped based on the similarities of the observations. If the assumptions of discriminant analysis are met, it is more powerful than logistic regression, but the assumptions are rarely met.

Linear discriminant analysis for training data where crime is the target and all other variables as predictors:

lda.fit <- lda(crime ~ ., data = train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2524752 0.2524752 0.2351485 0.2599010 
## 
## Group means:
##                  zn      indus        chas        nox         rm
## low       0.9125722 -0.9387150 -0.11793298 -0.8673105  0.4736813
## med_low  -0.1093329 -0.3488093  0.03646311 -0.5688043 -0.1690836
## med_high -0.3771136  0.1873251  0.10065938  0.4162256  0.1130320
## high     -0.4872402  1.0170492 -0.15984049  1.0326176 -0.4020019
##                 age        dis        rad        tax    ptratio
## low      -0.8856068  0.8641002 -0.6834950 -0.7579752 -0.4531402
## med_low  -0.3458643  0.3104087 -0.5472552 -0.4732297 -0.1080697
## med_high  0.3799665 -0.3934712 -0.3713703 -0.2805897 -0.3572507
## high      0.8077246 -0.8400902  1.6388211  1.5145512  0.7815834
##                black       lstat         medv
## low       0.38002361 -0.79485416  0.557351222
## med_low   0.32402858 -0.11904702  0.006026785
## med_high  0.08848333 -0.02028448  0.164792604
## high     -0.83977741  0.83487436 -0.703580334
## 
## Coefficients of linear discriminants:
##                 LD1         LD2         LD3
## zn       0.09209741  0.70815743 -0.93258782
## indus    0.06436257 -0.42509973 -0.07768853
## chas    -0.10524684 -0.03206491  0.18614819
## nox      0.35361675 -0.66730372 -1.30335811
## rm      -0.08297439 -0.07536678 -0.27900932
## age      0.28570244 -0.26842679 -0.14145173
## dis     -0.05357907 -0.21937735  0.04385065
## rad      2.97577488  1.01266314 -0.40547138
## tax      0.01657693 -0.09476859  1.00152467
## ptratio  0.07894599  0.08215838 -0.22638286
## black   -0.12801408 -0.02286222  0.08739739
## lstat    0.21013320 -0.16474776  0.41952427
## medv     0.14909642 -0.37622721 -0.14332686
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9450 0.0388 0.0162

The LDA function calculates the group means for each variable, coefficients and the proportions of trace (which is the proportion of between-class variance that is explained by successive discriminant functions). Here there are three linear discriminants, since the crime variable has four classes (total number is n-1).

# Creating a biplot to visualize the LDA
# Creating an arrow function (DataCamp)
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# Converting the target classes as numeric ()
classes <- as.numeric(train$crime)
# Plotting the LDA results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)

Next phase is to fit the testing data to the LDA and predict the classes for the values. Since the correct values are stoder in the correct_classes variable, I will cross-tabulate the predicted values and the correct values to see whether the classifier classified the values correctly.

# Predicting the classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# Cross-tabulating the correct classes and the predicted classes (lda.pred$class)
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       15      10        0    0
##   med_low    6      13        5    0
##   med_high   1      14       16    0
##   high       0       0        0   22
  • Almost all high values were correclty predicted
  • Almost half of the low values were falsely predicted and classified into med_low and some in med_high classes
  • Most of the med_high values were correctly classified, apart from 8 values in med_low class and one in low class

The model is best to classify high values, but the lower the value, the less correct the prediction.

Part 4: Distance measuresand clustering

Distance measures are used to calculate how similar or dissimilar observations are from each other. K-means is one way to calculate clustering, and is easy to use. The k-means algorithm is updated until the centroids and clusters don’t change.

In order to find clusters for the dataset, I will reload the Boston dataset and standardize it. Then I will calculate the distances between the observations and use k-means to find the optimal amount of clusters. The optimal amount of clusters is calculated with the total within cluster sum of squares by adding a WCSS to every cluster nad adding them together. The optimal number is reached, when the WCSS drops radically.

Comments are in the codeblock.

# Loading the Boston dataset
data(Boston)

# Scaling the data to get comparable distances
scaled <- scale(Boston)

# Calculating the Euclidean distances for the dataset
dist_eu <- dist(Boston)
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.119  85.620 170.500 226.300 371.900 626.000
# Calculating also the distance with "Manhattan"-method
dist_man <- dist(Boston, method = "manhattan")
summary(dist_man)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    2.016  149.100  279.500  342.900  509.700 1198.000

As we can see, the manhattan method makes the distances longer. Next we do the K-means clustering and find the optimal number of clusters. Comments are in the codeblock.

library (ggplot2)
set.seed(123)

# First deetermining the number of clusters in order to find the ideal amount
k_max <- 10

# Calculating the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})

# Visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line') 

# We see that 2 clusters is a good amount (the point where the line rapidly drops)
# K-means clustering
km <-kmeans(Boston, centers = 2)

# Plotting the Boston dataset with clusters, taking columns 6 to 10 for reference
pairs(Boston[6:10], col = km$cluster)  

Here we see that the data is classified into two different clusters, and that the groups are quite separate from each others, except some points (like in ‘rad’ and all ‘dis’, ‘age’ and ‘rm’ variables).


Week 5: Dimensionality reduction techniques

Description of the exercise phases and interpretation of the results

Part 1: Data

Data description

This week’s exercise is about reducing the dimensionality of multidimensional data. The dataset originates from the United Nations Development Programme and includes 155 observations from different countries and 8 variables, that are:

HDI ranking HDI, Life expectancy, Expected years of education, Mean education time, Gross national income, GNI minus ranking, Gender inequality index (GII) ranking, GII, Maternal mortality, Adolescent birth rate, Females in the parliament, Secondary education for Females and Males, Labor force participation for F and M And ratios of Females and Males in secondary education and labor force.

The data description can be found here and techiques for calculating each of these variables can be found here.

Graphical overview and data structure:

# Structure of the data
str(human)
## 'data.frame':    155 obs. of  8 variables:
##  $ Edu2.FM  : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ Labo.FM  : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ Life.Exp : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ Edu.Exp  : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ GNI      : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ Mat.Mor  : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ Ado.Birth: num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ Parli.F  : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
summary(human)
##     Edu2.FM          Labo.FM          Life.Exp        Edu.Exp     
##  Min.   :0.1717   Min.   :0.1857   Min.   :49.00   Min.   : 5.40  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:66.30   1st Qu.:11.25  
##  Median :0.9375   Median :0.7535   Median :74.20   Median :13.50  
##  Mean   :0.8529   Mean   :0.7074   Mean   :71.65   Mean   :13.18  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:77.25   3rd Qu.:15.20  
##  Max.   :1.4967   Max.   :1.0380   Max.   :83.50   Max.   :20.20  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50

All variables are numeric. Interesting how the the maximum representation of females in parliament is 57.50 % and minimum is zero… Next two graphs show the structure of the variables and the correlations between them.

# Visualizing variables
ggpairs(human)

cor(human)
##                Edu2.FM      Labo.FM   Life.Exp     Edu.Exp         GNI
## Edu2.FM    1.000000000  0.009564039  0.5760299  0.59325156  0.43030485
## Labo.FM    0.009564039  1.000000000 -0.1400125  0.04732183 -0.02173971
## Life.Exp   0.576029853 -0.140012504  1.0000000  0.78943917  0.62666411
## Edu.Exp    0.593251562  0.047321827  0.7894392  1.00000000  0.62433940
## GNI        0.430304846 -0.021739705  0.6266641  0.62433940  1.00000000
## Mat.Mor   -0.660931770  0.240461075 -0.8571684 -0.73570257 -0.49516234
## Ado.Birth -0.529418415  0.120158862 -0.7291774 -0.70356489 -0.55656208
## Parli.F    0.078635285  0.250232608  0.1700863  0.20608156  0.08920818
##              Mat.Mor  Ado.Birth     Parli.F
## Edu2.FM   -0.6609318 -0.5294184  0.07863528
## Labo.FM    0.2404611  0.1201589  0.25023261
## Life.Exp  -0.8571684 -0.7291774  0.17008631
## Edu.Exp   -0.7357026 -0.7035649  0.20608156
## GNI       -0.4951623 -0.5565621  0.08920818
## Mat.Mor    1.0000000  0.7586615 -0.08944000
## Ado.Birth  0.7586615  1.0000000 -0.07087810
## Parli.F   -0.0894400 -0.0708781  1.00000000
# Creating a correlation matrix and visualizing the correlations
res1 <- cor.mtest(human, conf.level = .95)
cor(human) %>% corrplot(p.mat = res1$p, method = "color", type = "upper",
         sig.level = c(.001, .01, .05), pch.cex = .9,
         insig = "label_sig", pch.col = "white", order = "AOE")

Most of the variables are not normally distributed, and there is positive skewness for example in maternal mortality, adolescence birth rate and GNI, and negative skewness in labor ratio and life expectancy. Quite many of the variables correlate with each other rather well, considering that the data and variables are quite well known and straight-forward. Biggest positive correlations can be seen between maternal mortality and adolescence birth ratio, life expectancy and expected years in education, and negative correlation between maternal mortality and life expectancy. As a single variable, adolescent birth date (also referred to as the age-specific fertility rate for women aged 15–19) has quite significant correlation with all variables, since the higher fertility rate for women under 19 years old, the smaller is the life expectancy and education, but higher risk or maternal mortality. From the correlation matrix we can see that almost all variables are statistically significant with 95 % confidence level.

Part 2: Principal Component Analysis (PCA)

At this part of the analysis I will perform PCA first on the non standardized data, and after on the standardized data. PCA is a statistical unsupervised procedure, one of dimensionality reduction techniques, which are used to reduce the “noise” and error of the data, and to find the phenomenon of interest by focusing only on the essential dimensions. In PCA the data is first transformed to another ‘space’, with equal number of dimensions with the original data. The first PC has the maximum amount of variance from the original dataset, the second has what the first one didn’t capture, third one what the second one didn’t and so on, giving uncorrelated variables. Therefore usually the first few principal components are sufficient to represent the data.

First PCA on non standardized data:

# Performing PCA on non standardized human data
pca_human <- prcomp(human)
summary(pca_human)
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000
##                           PC8
## Standard deviation     0.1591
## Proportion of Variance 0.0000
## Cumulative Proportion  1.0000
# Drawing a biplot of the PC1 and PC2
s <- summary(pca_human)
# Drawing a biplot of the PC1 and PC2
pca_pr <- round(100*s$importance[2, ], digits = 1)
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

Without any kind of standardization, the PCA performs quite badly, as we see from the plot above. PCA projects the original data to direction with maximum variance - hence, without rescaling, PCA will load on the large variances, and therefore only one component explains all the data and correlation (PC1 100%) vs PC2 0%). As in the plot above, we see that the data has been indeed loaded mostly on the frst component, and doesn’t show any meaningful results. Only Qatar, and a few other variables stand out in this plot. Arrows show the direction to the original variables and PC’s. GNI variable stands out in this plot substantially because of its length, which is propotional to its standard deviation. It’s arrow is also pointing to the second PC as the only variable.

Next PCA on standardized data, where mean is 0:

# Standardizing data
human_std <- scale(human)
# Performing PCA on standardized human data
pca_human <- prcomp(human_std)
summary(pca_human)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069
##                            PC7     PC8
## Standard deviation     0.45900 0.32224
## Proportion of Variance 0.02634 0.01298
## Cumulative Proportion  0.98702 1.00000
# Print the summary
s <- summary(pca_human)
# Drawing a biplot of the PC1 and PC2
pca_pr <- round(100*s$importance[2, ], digits = 1)
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])

Here the data has been scaled so that the data is independent of any rescaling, and is therefore easier to interpret. From the table we see that the data is more evenly distributed between the components. From the arrows we can see the correlations between the variables, for example, expected years of education and life expetancy are very close to each other, and they have correlation 0.79, and negative correlation with the arrows on the positive side of the x-axis. Also we see that maternal mortality rate and adolescent birth rate are very highly correlated, but have negative correlation with expected education and life expectancy respectively. Female/male labor force ratio and female representation in the parliament variables are with both sides (negative and positive of x-axis) close to zero, implying quite neutral correlation. The angle between points and variables imply their correlation. For example, from the plot one could say based on the country names that least developed countries are more likely to have negative correlation between life expectancy, for example and to be closer to the arrows for maternal mortality rate, and more developed countries situated close to higher values for expected years of education, such as Switzerland. We can also say that the variables are pointing quite evenly to both dimensions based on the directions of the arrows.


Part 3: Multiple Correspondance Analysis

Data is from R package Factominer

Multiple correspondace analysis is also a dimensionality reduction method, which is used to analyse several categorical variables.

First I will wrangle the data to our liking, keeping only cetrain columns from the ‘tea’ data. All the variables are categorical factor variables.

require(FactoMineR)
## Loading required package: FactoMineR
data("tea")
tea_time = tea[, c("Tea", "How", "how", "sugar", "where", "lunch")]
summary(tea_time)
##         Tea         How                      how           sugar    
##  black    : 74   alone:195   tea bag           :170   No.sugar:155  
##  Earl Grey:193   lemon: 33   tea bag+unpackaged: 94   sugar   :145  
##  green    : 33   milk : 63   unpackaged        : 36                 
##                  other:  9                                          
##                   where           lunch    
##  chain store         :192   lunch    : 44  
##  chain store+tea shop: 78   Not.lunch:256  
##  tea shop            : 30                  
## 
str(tea_time)
## 'data.frame':    300 obs. of  6 variables:
##  $ Tea  : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How  : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ how  : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...

Visualization of the variables:

gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar(aes(fill = "red")) + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped

Next I’ll perform MCA analysis on the tea_time data:

mca <- MCA(tea_time, graph = FALSE)
summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               0.279   0.261   0.219   0.189   0.177   0.156
## % of var.             15.238  14.232  11.964  10.333   9.667   8.519
## Cumulative % of var.  15.238  29.471  41.435  51.768  61.434  69.953
##                        Dim.7   Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.144   0.141   0.117   0.087   0.062
## % of var.              7.841   7.705   6.392   4.724   3.385
## Cumulative % of var.  77.794  85.500  91.891  96.615 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.298  0.106  0.086 | -0.328  0.137  0.105 | -0.327
## 2                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 3                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 4                  | -0.530  0.335  0.460 | -0.318  0.129  0.166 |  0.211
## 5                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 6                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 7                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 8                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 9                  |  0.143  0.024  0.012 |  0.871  0.969  0.435 | -0.067
## 10                 |  0.476  0.271  0.140 |  0.687  0.604  0.291 | -0.650
##                       ctr   cos2  
## 1                   0.163  0.104 |
## 2                   0.735  0.314 |
## 3                   0.062  0.069 |
## 4                   0.068  0.073 |
## 5                   0.062  0.069 |
## 6                   0.062  0.069 |
## 7                   0.062  0.069 |
## 8                   0.735  0.314 |
## 9                   0.007  0.003 |
## 10                  0.643  0.261 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr
## black              |   0.473   3.288   0.073   4.677 |   0.094   0.139
## Earl Grey          |  -0.264   2.680   0.126  -6.137 |   0.123   0.626
## green              |   0.486   1.547   0.029   2.952 |  -0.933   6.111
## alone              |  -0.018   0.012   0.001  -0.418 |  -0.262   2.841
## lemon              |   0.669   2.938   0.055   4.068 |   0.531   1.979
## milk               |  -0.337   1.420   0.030  -3.002 |   0.272   0.990
## other              |   0.288   0.148   0.003   0.876 |   1.820   6.347
## tea bag            |  -0.608  12.499   0.483 -12.023 |  -0.351   4.459
## tea bag+unpackaged |   0.350   2.289   0.056   4.088 |   1.024  20.968
## unpackaged         |   1.958  27.432   0.523  12.499 |  -1.015   7.898
##                       cos2  v.test     Dim.3     ctr    cos2  v.test  
## black                0.003   0.929 |  -1.081  21.888   0.382 -10.692 |
## Earl Grey            0.027   2.867 |   0.433   9.160   0.338  10.053 |
## green                0.107  -5.669 |  -0.108   0.098   0.001  -0.659 |
## alone                0.127  -6.164 |  -0.113   0.627   0.024  -2.655 |
## lemon                0.035   3.226 |   1.329  14.771   0.218   8.081 |
## milk                 0.020   2.422 |   0.013   0.003   0.000   0.116 |
## other                0.102   5.534 |  -2.524  14.526   0.197  -7.676 |
## tea bag              0.161  -6.941 |  -0.065   0.183   0.006  -1.287 |
## tea bag+unpackaged   0.478  11.956 |   0.019   0.009   0.000   0.226 |
## unpackaged           0.141  -6.482 |   0.257   0.602   0.009   1.640 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.126 0.108 0.410 |
## How                | 0.076 0.190 0.394 |
## how                | 0.708 0.522 0.010 |
## sugar              | 0.065 0.001 0.336 |
## where              | 0.702 0.681 0.055 |
## lunch              | 0.000 0.064 0.111 |

From the summary table we can read the variances and the predentages of variances for each dimension. Individuals show the rows (here only 10 first), contribution and squared correlations, and categories the names of categories. 5 variables have v.values above 1.96/less than -1.96, which means its coordinate is significantly different from zero. The last box, categorial variables shows the links between dimensions and variables, and the closer to 1, the stronger link. Strongest are low-capital ‘how’ and ‘where’ variables to Dim.1. Next is a plotted MCA:

# visualize MCA
plot(mca, invisible=c("ind"), habillage = "quali")

The categories under one variable have the same colour. Quite many variables are similar to each other, and the most significant outliers are tea shop and unpacked. From the previous table we can see that these two variables belong to the ‘how’ and ‘where’ groups, so as predicted, they are far from zero.

cats <- apply(tea_time, 2, function(x) nlevels(as.factor(x)))

# data frame with variable coordinates
mca1_vars_df = data.frame(mca$var$coord, Variable = rep(names(cats), cats))

# data frame with observation coordinates
mca1_obs_df = data.frame(mca$ind$coord)

# plot of variable categories
ggplot(data = mca1_obs_df, aes(x = Dim.1, y = Dim.2)) +
  geom_hline(yintercept = 0, colour = "gray70") +
  geom_vline(xintercept = 0, colour = "gray70") +
  geom_point(colour = "gray50", alpha = 0.7) +
  geom_density2d(colour = "gray80") +
  geom_text(data = mca1_vars_df, 
            aes(x = Dim.1, y = Dim.2, 
                label = rownames(mca1_vars_df), colour = Variable)) +
  ggtitle("MCA plot of variables using R package FactoMineR") +
  scale_colour_discrete(name = "Variable")

From this plot we can see both observations and categories, as well as density curves to see where the observations are concentrated.


Week 6: Analysis of Longitudinal Data

Description of the exercise phases and interpretation of the results

Part 1: Data

Data description

This week’s theme is longitudinal data, which is the name for data that has repeated measures; if a response variable is measured under different conditions over different times on a same individual/subject, it is called longitudinal data. This kind of data helps to understand the data by identifying the observations as individuals, and data type occurs most frequently in for example psychological testing, when a subject is tested over a certain time. The data and literature used for this week’s exercise is produced by Kimmo Vehkalahti (Vehkalahti and Everitt, 2019), and I will explore the datasets with help from Chapters 8 & 9 from the book.

The first data is from a study by Crowder and Hand (1990), in which they examined nutrition of three groups of rats that were put on different diets, and each animal’s body weight in grams was recorder over a nine-week period (Vehkalahti and Everitt, 2019, Chapter 9). Here the column ‘rats’ means weights in grams.

Second BPRS data, by Davis (2002) is a study in which 40 males were assigned to two different treatments and observed once a week for eight weeks and given a brief psychiatric rating score (BPRS), which assesses the level of 18 symptom constructs such as hostility, suspiciousness, hallucinations and grandiosity; each of these is rated from one (not present) to seven (extremely severe).The scale is used to evaluate patients suspected of having schizophrenia (source: DataCamp).

These two datasets have been converted from wide form into longitudinal form.

Next I present the structure and summaries of the datasets:

# Reading the data

BPRS <- read.csv("~/Documents/GitHub/IODS-project/data/BPRSL.csv", sep = ",", header = TRUE, row.names = 1)
RATS <- read.csv("~/Documents/GitHub/IODS-project/data/RATSL.csv", sep = ",", header = TRUE, row.names = 1)
head(RATS)
##   ID Group time rats Time
## 1  1     1  WD1  240    1
## 2  2     1  WD1  225    1
## 3  3     1  WD1  245    1
## 4  4     1  WD1  260    1
## 5  5     1  WD1  255    1
## 6  6     1  WD1  260    1
head(BPRS)
##   treatment subject weeks bprs week
## 1         1       1 week0   42    0
## 2         1       2 week0   58    0
## 3         1       3 week0   54    0
## 4         1       4 week0   55    0
## 5         1       5 week0   72    0
## 6         1       6 week0   48    0
str(RATS)
## 'data.frame':    176 obs. of  5 variables:
##  $ ID   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Group: int  1 1 1 1 1 1 1 1 2 2 ...
##  $ time : Factor w/ 11 levels "WD1","WD15","WD22",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ rats : int  240 225 245 260 255 260 275 245 410 405 ...
##  $ Time : int  1 1 1 1 1 1 1 1 1 1 ...
str(BPRS)
## 'data.frame':    360 obs. of  5 variables:
##  $ treatment: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ subject  : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ weeks    : Factor w/ 9 levels "week0","week1",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ bprs     : int  42 58 54 55 72 48 71 30 41 57 ...
##  $ week     : int  0 0 0 0 0 0 0 0 0 0 ...
summary(RATS)
##        ID            Group           time         rats      
##  Min.   : 1.00   Min.   :1.00   WD1    :16   Min.   :225.0  
##  1st Qu.: 4.75   1st Qu.:1.00   WD15   :16   1st Qu.:267.0  
##  Median : 8.50   Median :1.50   WD22   :16   Median :344.5  
##  Mean   : 8.50   Mean   :1.75   WD29   :16   Mean   :384.5  
##  3rd Qu.:12.25   3rd Qu.:2.25   WD36   :16   3rd Qu.:511.2  
##  Max.   :16.00   Max.   :3.00   WD43   :16   Max.   :628.0  
##                                 (Other):80                  
##       Time      
##  Min.   : 1.00  
##  1st Qu.:15.00  
##  Median :36.00  
##  Mean   :33.55  
##  3rd Qu.:50.00  
##  Max.   :64.00  
## 
summary(BPRS)
##    treatment      subject          weeks          bprs            week  
##  Min.   :1.0   Min.   : 1.00   week0  : 40   Min.   :18.00   Min.   :0  
##  1st Qu.:1.0   1st Qu.: 5.75   week1  : 40   1st Qu.:27.00   1st Qu.:2  
##  Median :1.5   Median :10.50   week2  : 40   Median :35.00   Median :4  
##  Mean   :1.5   Mean   :10.50   week3  : 40   Mean   :37.66   Mean   :4  
##  3rd Qu.:2.0   3rd Qu.:15.25   week4  : 40   3rd Qu.:43.00   3rd Qu.:6  
##  Max.   :2.0   Max.   :20.00   week5  : 40   Max.   :95.00   Max.   :8  
##                                (Other):120

Part 2: RATS

Part 2 examines the RATS data by applying analyses from MABS (Vehkalahti and Everitt, 2019) Chapter 8: Analysis of Longitudinal Data I: Graphical Displays and Summary Measure Approach.

Graphical display

First I will show a graphical display of the individual response profiles before and after standardization.

# Draw the plot RATS
ggplot(RATS, aes(x = Time, y = rats, linetype = as.factor(ID))) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ Group, labeller = label_both) +
  theme(legend.position = "none") + 
  scale_y_continuous(limits = c(min(RATS$rats), max(RATS$rats))) 

We see that in each group almost all individuals follow the same curve, which goes first up, and then decreases to almost same levels as in the beginning. We can also note that there are big weight differences between all groups, increasing by group number. There is also one significant outlier in the second group. Also we note that rats who are heavier in the beginning also have higher scores in the end, a phenomenon called tracking (Vehkalahti and Everitt, 2019, Chapter 8). Tracking can be better seen from standardized data below.

# Standardizing bprs values
RATSS <- RATS %>%
  group_by(Time) %>%
  mutate(stdrats = (rats - mean(rats))/sd(rats) ) %>%
  ungroup()
# Plot
ggplot(RATSS, aes(x = Time, y = stdrats, linetype = as.factor(ID))) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ Group, labeller = label_both) +
  scale_y_continuous(name = "standardized weight")

In order to better understand the data, it is useful to calculate mean profiles for each group and compare them with each other. Choosing a summary method depends on the question of interest. I will present simple summaries with next two plots. This is done below by calculating the mean and standard error, for variation, for each treatment group:

# Calculating mean and standard error
# Length of time
RATS$Group <- factor(RATS$Group)
n <- RATS$Time %>% unique() %>% length()
# Summarizing
RATSsum <- RATS %>%
  group_by(Group, Time) %>%
  summarise( mean = mean(rats), se = sd(rats/sqrt(n)) ) %>%
  ungroup()

# Plot the mean profiles
ggplot(RATSsum, aes(x = Time, y = mean, linetype = as.factor(Group), shape = as.factor(Group))) +
  geom_line() +
  scale_linetype_manual(values = c(1,2,3)) +
  geom_point(size=3) +
  scale_shape_manual(values = c(1,2,3)) +
  geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) +
  scale_y_continuous(name = "mean(rats) +/- se(rats)") 

In the summary profile we see that the groups are quite different from each other, with greatest variation (standard error) in the second group, highest values in the third group and lowest values with smallest standard error in the first group. Clearly the diet made the rats just fatter over time. Next, in order to compare the groups with each other better, I will draw boxplots to spot outliers and remove them so it won’t bias the further comparisons.

# Summary
RATStime <- RATS %>%
  filter(Time > 1) %>%
  group_by(Group, ID) %>%
  summarise( mean=mean(rats) ) %>%
  ungroup()

ggplot(RATStime, aes(x = Group, y = mean, )) +
  geom_boxplot() +
  stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white") +
  scale_y_continuous(name = "mean(rats), days")

We can see from the boxplots above that in each group there is one outlier. I will try to remove at least two of them and see what kind of difference it makes.

RATStime1 <- filter(RATStime, mean > 250 & mean < 550)

ggplot(RATStime1, aes(x = Group, y = mean, )) +
  geom_boxplot() +
  stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white") +
  scale_y_continuous(name = "mean(rats), days")

I removed two of the outliers, which decreased the variance withing the groups, but no real evidence of the changes. Next I will run a t-test and calculate confidence intervals on the data to assess differences. Since there are three groups, I will use ANOVA to find out if there is any significant difference between the average weights of rats. In the Data Camp a two-sided t-test was performed, but since there are three groups, I found using ANOVA suits best.

res.aov <- aov(mean ~ Group, data = RATStime1)
# Summary of the analysis
summary(res.aov)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## Group        2 189317   94659   611.4 5.32e-12 ***
## Residuals   11   1703     155                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Not surprisingly, there is significant difference between the means. This is still very shallow and simple graphical analysis of the observations, and requires deeper digging.

Part 3: BPRS

Part 2 examines the **BPRS* data by applying analyses from MABS (Vehkalahti and Everitt, 2019) Chapter 9: Analysis of Longitudinal Data II: Linear Mixed Effects Models for Normal Response Variables. We will use linear mixed models to examine the correlations in the data. Like with the RATS-data, with BPRS data we also ignore that there are multiple observations of same people.

ggplot(BPRS, aes(x = week, y = bprs, linetype = as.factor(subject))) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ treatment, labeller = label_both) +
  theme(legend.position = "none") + 
  scale_y_continuous(limits = c(min(BPRS$bprs), max(BPRS$bprs))) 

Next I will fit a regression model by still ignoring the repeated-measures structure:

# create a regression model BPRS_reg
BPRS_reg <- lm(bprs ~ week + treatment, data = BPRS)
summary(BPRS_reg)
## 
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRS)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.454  -8.965  -3.196   7.002  50.244 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  45.8817     2.2949  19.993   <2e-16 ***
## week         -2.2704     0.2524  -8.995   <2e-16 ***
## treatment     0.5722     1.3034   0.439    0.661    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared:  0.1851, Adjusted R-squared:  0.1806 
## F-statistic: 40.55 on 2 and 357 DF,  p-value: < 2.2e-16

We can see that treatment groups have no statistical significance, however the time variable ‘week’ has. Next a more formal analysis is to fit a random intercept model for the explanatory variables treatment and week. This allows the linear regression fit for each man to differ in intercept from other men, by introducing random effects to subjects.

# Create a random intercept model
BPRS_ref <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRS, REML = FALSE)
summary(BPRS_ref)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
##    Data: BPRS
## 
##      AIC      BIC   logLik deviance df.resid 
##   2748.7   2768.1  -1369.4   2738.7      355 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0481 -0.6749 -0.1361  0.4813  3.4855 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  subject  (Intercept)  47.41    6.885  
##  Residual             104.21   10.208  
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  45.8817     2.4413  18.794
## week         -2.2704     0.2084 -10.896
## treatment     0.5722     1.0761   0.532
## 
## Correlation of Fixed Effects:
##           (Intr) week  
## week      -0.341       
## treatment -0.661  0.000

Results of the random intercept model are above. We can see, for example, that the variability of a single subject (man) is 6.885. Finally I will fit a random intercept and andom slope model. It allows the linear regression fits for each individual in the data to differ in intercept but also in slope. This way it is possible to account for the individual differences in the men’s bprs values, but also the effect of time (Source: DataCamp). So in the next summary we see the differences in the men’s bprs value change and the effect of time.

# create a random intercept and random slope model
BPRS_ref1 <- lmer(bprs ~ week + treatment + (week | subject), data = BPRS, REML = FALSE)
#Summary of the model
summary(BPRS_ref1)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
##    Data: BPRS
## 
##      AIC      BIC   logLik deviance df.resid 
##   2745.4   2772.6  -1365.7   2731.4      353 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.8919 -0.6194 -0.0691  0.5531  3.7976 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.8201  8.0511        
##           week         0.9608  0.9802   -0.51
##  Residual             97.4307  9.8707        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  45.8817     2.5685  17.864
## week         -2.2704     0.2977  -7.626
## treatment     0.5722     1.0405   0.550
## 
## Correlation of Fixed Effects:
##           (Intr) week  
## week      -0.477       
## treatment -0.608  0.000
# ANOVA test
anova(BPRS_ref1, BPRS_ref)
## Data: BPRS
## Models:
## BPRS_ref: bprs ~ week + treatment + (1 | subject)
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
##           Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)  
## BPRS_ref   5 2748.7 2768.1 -1369.4   2738.7                           
## BPRS_ref1  7 2745.4 2772.6 -1365.7   2731.4 7.2721      2    0.02636 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the ANOVA-test results we can see that the p value of the likelihood test between BPRS_ref1 (intercept) and BPRS_ref1 (slope) is 0.026 and significant in 99 % confidence level. The lower the value, the better the fit of the model.